Analyzing the Dynamics of Film Revenue: A Statistical Approach

Author

Shashanka Jagadeesh: 23101892

Published

April 4, 2024

Business Implication How can filmmakers and distributors leverage insights from statistical analyses of budget, scores, and content themes to strategize for higher box office returns?


Libraries

The below code chunk is used to install and load the necessary R packages for data manipulation, visualization, and analysis. It ensures that all required libraries are available for use in the subsequent analyses. CRAN - Available Packages by Name” (n.d.)

Code
options(repos = c(CRAN = "https://cran.rstudio.com/"))
if (!require("tm", quietly = TRUE)) install.packages("tm", quiet = TRUE)
if (!require("wordcloud", quietly = TRUE)) install.packages("wordcloud", quiet = TRUE)
if (!require("RColorBrewer", quietly = TRUE)) install.packages("RColorBrewer", quiet = TRUE)
if (!require("leaflet", quietly = TRUE)) install.packages("leaflet", quiet = TRUE)
if (!require("countrycode", quietly = TRUE)) install.packages("countrycode", quiet = TRUE)
if (!require("mgcv", quietly = TRUE)) install.packages("mgcv", quiet = TRUE)
if (!require("maps", quietly = TRUE)) install.packages("maps", quiet = TRUE)
if (!require("ggplot2", quietly = TRUE)) install.packages("ggplot2", quiet)
install.packages(c("readr", "corrplot", "reshape2", "gbm", "randomForest"))

package ‘readr’ successfully unpacked and MD5 sums checked package ‘corrplot’ successfully unpacked and MD5 sums checked package ‘reshape2’ successfully unpacked and MD5 sums checked package ‘gbm’ successfully unpacked and MD5 sums checked package ‘randomForest’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in C:_packages

Code
library(ggplot2)
library(maps)
library(randomForest)
library(caret)
library(gbm)
library(mgcv)
library(readr)
library(dplyr)
library(tidyverse)
library(mgcv)
library(corrplot)
library(tm)
library(wordcloud)
library(RColorBrewer)
library(tidyr)
library(leaflet)
library(countrycode)
library(plotly)
library(reshape2)
library(stats)

Loading all datasets

Here, datasets are loaded into R from the specified paths. Each dataset corresponds to a different source of movie titles. (K. n.d.; B. n.d., n.d.)

Code
disney_plus_titles <- read.csv("C:/Users/PC/Documents/GitHub/RAssignment/netflix_titles.csv", fileEncoding = "UTF-8")
netflix_titles <- read.csv("C:/Users/PC/Documents/GitHub/RAssignment/disney_plus_titles.csv", fileEncoding = "UTF-8")
imdb_movies <- read.csv("C:/Users/PC/Documents/GitHub/RAssignment/imdb_movies.csv", fileEncoding = "UTF-8")

Mermaid

This Mermaid diagram provides a visual flowchart of how the datasets will be merged. It outlines the keys and columns in each dataset, leading to the formation of the final dataset. Mermaid: Generation of Diagram and Flowchart from Text in a Similar Manner as Markdown (n.d.)

graph TD;
      classDef dataset fill:#007bff,stroke:#333,stroke-width:2px,color:#000;
      classDef process fill:#28a745,stroke:#333,stroke-width:2px,color:#000;
      classDef final fill:#dc3545,stroke:#333,stroke-width:4px,color:#000;

    A("Netflix Titles\n- Key: title\n- Columns: type,title,director,cast");
    B("Disney Plus Titles\n- Key: title\n- Columns: type,title,director,cast");
    D("IMDb Movies\n- Key: title\n- Columns: title,score,genre,overview");

    C["Merging Process\n---\nKey: title"];
    E("final_dataset\n- Key: title\n- Merged Columns: type,title,director,cast,score,genre,overview");

    A --> C;
    B --> C;
    D --> C;
    C --> E;

    class A,B,D dataset;
    class C process;
    class E final;

Merging datasets

The following code filters movies from the titles datasets, adds a source identifier, combines them, and merges them with the IMDb movies dataset. The resulting final_dataset is then cleaned and prepped for analysis.

Code
disney_plus_movies <- filter(disney_plus_titles, type == "Movie")
netflix_movies <- filter(netflix_titles, type == "Movie")

disney_plus_movies$source <- 'Disney+'
netflix_movies$source <- 'Netflix'

# Combining the two datasets
combined_movies <- bind_rows(disney_plus_movies, netflix_movies)

# Removing duplicate movie titles
combined_movies_unique <- combined_movies %>%
  distinct(title, .keep_all = TRUE)

final_dataset <- merge(combined_movies_unique, imdb_movies, by = "title", all.x = TRUE)

final_dataset <- na.omit(final_dataset)

final_dataset$example_column[is.na(final_dataset$example_column)] <- median(final_dataset$example_column, na.rm = TRUE)

final_dataset <- unique(final_dataset)

#write.csv(final_dataset, "C:/Users/PC/Documents/Assignments/R programming/Assignment 2/Datasets/final_dataset1.csv", row.names = FALSE)

final_dataset <- select(final_dataset, -orig_title)
#final_dataset <- select(final_dataset, -overview)
final_dataset$title <- sapply(final_dataset$title, function(x) iconv(x, from = "UTF-8", to = "ASCII", sub = ""))
final_dataset$genre <- sapply(final_dataset$genre, function(x) iconv(x, from = "UTF-8", to = "ASCII", sub = ""))


final_dataset$budget_x[is.na(final_dataset$budget_x)] <- median(final_dataset$budget_x, na.rm = TRUE)
final_dataset$revenue[is.na(final_dataset$revenue)] <- median(final_dataset$revenue, na.rm = TRUE)

head(final_dataset,5)
                                         title show_id  type
7                                       #Alive   s2037 Movie
23                  10 Things I Hate About You    s708 Movie
31                              101 Dalmatians    s709 Movie
32 101 Dalmatians II: Patch's London Adventure    s710 Movie
33                              102 Dalmatians    s711 Movie
                                              director
7                                               Cho Il
23                                          Gil Junger
31 Wolfgang Reitherman, Hamilton Luske, Clyde Geronimi
32                           Jim Kammerud, Brian Smith
33                                          Kevin Lima
                                                                                               cast
7                                                                          Yoo Ah-in, Park Shin-hye
23 Heath Ledger, Julia Stiles, Joseph Gordon-Levitt, Larisa Oleynik, David Krumholtz, Andrew Keegan
31                  Rod Taylor, J. Pat O'Malley, Betty Gerson, Cate Bauer, Ben Wright, Fred Worlock
32    Barry Bostwick, Jason Alexander, Martin Short, Bobby Lockwood, Susanne Blakeslee, Samuel West
33         Glenn Close, Ioan Gruffudd, Alice Evans, Tim McInnerny, Ian Richardson, Gérard Depardieu
                       country.x        date_added release_year rating duration
7                    South Korea September 8, 2020         2020  TV-MA   99 min
23                 United States November 12, 2019         1999  PG-13   98 min
31                 United States November 12, 2019         1961      G   81 min
32                 United States November 12, 2019         2003      G   77 min
33 United States, United Kingdom November 12, 2019         2000      G  104 min
                                        listed_in
7  Horror Movies, International Movies, Thrillers
23                 Comedy, Coming of Age, Romance
31            Action-Adventure, Animation, Family
32                    Action-Adventure, Animation
33     Action-Adventure, Animals & Nature, Family
                                                                                                                                             description
7  As a grisly virus rampages a city, a lone man stays locked inside his apartment, digitally cut off from seeking help and desperate to find a way out.
23                                                                               Cameron falls for the girl of his dreams, but she is forbidden to date.
31                                                                                        Cruella De Vil dognaps all of the Dalmatian puppies in London.
32                                                                                    Pongo and Perdita's pup Patch gets the chance to meet his TV hero.
33                                                                              Oddball, the spotless Dalmatian puppy, heads out in search of his spots.
    source     date_x score                             genre
7  Disney+  6/24/2020    73  Horror,Action,Adventure,Thriller
23 Netflix  3/30/1999    76              Comedy,Romance,Drama
31 Netflix  12/5/1996    59                     Family,Comedy
32 Netflix 11/12/2002    60 Family,Adventure,Comedy,Animation
33 Netflix 11/21/2000    55                     Family,Comedy
                                                                                                                                                                                                                                                                                                                                                                                                            overview
7                                                                                                                                                                                                                                                              As a grisly virus rampages a city, a lone man stays locked inside his apartment, digitally cut off from seeking help and desperate to find a way out.
23 On the first day at his new school, Cameron instantly falls for Bianca, the gorgeous girl of his dreams. The only problem is that Bianca is forbidden to date until her ill-tempered, completely un-dateable older sister Kat goes out, too. In an attempt to solve his problem, Cameron singles out the only guy who could possibly be a match for Kat: a mysterious bad boy with a nasty reputation of his own.
31                                                                                                                                                                                                                                                                An evil, high-fashion designer plots to steal Dalmatian puppies in order to make an extravagant fur coat, but instead creates an extravagant mess.
32                                                                                                                                                                                                                     Being one of 101 takes its toll on Patch, who doesn't feel unique. When he's accidentally left behind on moving day, he meets his idol, Thunderbolt, who enlists him on a publicity campaign.
33           Get ready for a howling good time as an all new assortment of irresistible animal heroes are unleashed in this great family tail! In an unlikely alliance, the outrageous Waddlesworth - a parrot who thinks he's a Rottweiler - teams up with Oddball - an un-marked Dalmatian puppy eager to earn her spots! Together they embark on a laugh-packed quest to outwit the ever-scheming Cruella De Vil.
                                                                                                                                                                                                                                                                                             crew
7                                 Yoo Ah-in, Oh Joon-woo, Park Shin-hye, Kim Yoo-bin, Lee Hyun-wook, Sang-chul, Jin So-yeon, Elena Kim, Kim Hak-sun, Joon-woo's Father, So Hee-jung, Joon-woo's Mother, Joo Bo-bi, Joon-woo's Sister, Jeon Bae-soo, Masked Man, Lee Chae-kyung, Masked Man’s Wife
23       Julia Stiles, Katarina Stratford, Heath Ledger, Patrick Verona, Joseph Gordon-Levitt, Cameron James, Larisa Oleynik, Bianca Stratford, David Krumholtz, Michael Eckman, Andrew Keegan, Joey Donner, Susan May Pratt, Mandella, Gabrielle Union, Chastity, Larry Miller, Walter Stratford
31                                                   Glenn Close, Cruella De Vil, Jeff Daniels, Roger Dearly, Joely Richardson, Anita Campbell-Green, Joan Plowright, Nanny, Hugh Laurie, Jasper, Mark Williams, Horace, John Shrapnel, L. Skinner, Tim McInnerny, Alonzo, Hugh Fraser, Frederick
32 Barry Bostwick, Thunderbolt (voice), Jason Alexander, Lightning (voice), Martin Short, Lars (voice), Bobby Lockwood, Patch (voice), Susanne Blakeslee, Cruella (voice), Samuel West, Pongo (voice), Maurice LaMarche, Horace (voice), Jeff Bennett, Jasper (voice), Jodi Benson, Anita (voice)
33                                      Glenn Close, Cruella de Vil, Ioan Gruffudd, Kevin Shepherd, Alice Evans, Chloe Simon, Tim McInnerny, Alonzo, Gérard Depardieu, Jean Pierre Le Pelt, Eric Idle, Waddlesworth (voice), Ben Crompton, Ewan, Carol MacReady, Agnes, Ian Richardson, Mr. Torte
      status orig_lang  budget_x   revenue country.y
7   Released    Korean   6300000  13416285        KR
23  Released   English  13000000  60413950        AU
31  Released   English  54000000 320689294        AU
32  Released   English 131600000 520075508        AU
33  Released   English  85000000  66941559        US

Checking correlation before analysis

This section creates a correlation matrix to understand the linear relationships between numerical variables. A heatmap is plotted to visualize these relationships, aiding in the assessment of potential predictor variables for the upcoming models.

Code
#Selecting only numerical columns for the correlation matrix
numerical_data <- final_dataset[, c("release_year", "budget_x", "revenue", "score")]

# Calculate the correlation matrix
cor_matrix <- cor(numerical_data, use = "pairwise.complete.obs")  # Handles missing values

# Melt the correlation matrix for use with ggplot
melted_cor_matrix <- melt(cor_matrix)

# Plotting the heatmap
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Pearson\nCorrelation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 12, hjust = 1),
        axis.text.y = element_text(size = 12)) +
  labs(x = '', y = '', title = 'Heatmap of Correlation Matrix') +
  coord_fixed()

Code
cor_matrix <- cor(final_dataset[, c("release_year", "budget_x", "revenue", "score")], 
                  use = "complete.obs")

# Optionally, round the correlation matrix for easier viewing
cor_matrix_rounded <- round(cor_matrix, 2)

# Print the rounded correlation matrix
print(cor_matrix_rounded)
             release_year budget_x revenue score
release_year         1.00     0.18    0.13 -0.07
budget_x             0.18     1.00    0.72 -0.03
revenue              0.13     0.72    1.00  0.12
score               -0.07    -0.03    0.12  1.00
Code
cor_matrix_df <- as.data.frame(cor_matrix_rounded)

# View the data frame
View(cor_matrix_df)

The correlation coefficients between revenue and the predictor variables (budget, release year, and score) are relatively low. This suggests that there may not be strong linear relationships between these variables. However, correlation analysis only assesses linear relationships, and it’s possible that nonlinear relationships exist.

Quantitative Analysis

GLM Regression model

Generalized Linear Models (GLM) are fitted to capture potential nonlinear relationships between the response variable (revenue) and predictors. The summary and predictions provide insight into the model’s performance.

Code
set.seed(123) # for reproducibility
index <- createDataPartition(final_dataset$revenue, p = 0.7, list = FALSE)
train_data <- final_dataset[index, ]
test_data <- final_dataset[-index, ]

# training data
glm_model_no_genre <- glm(revenue ~ budget_x + release_year + score, 
                          data = train_data, 
                          family = gaussian())

# Summary of the GLM model
summary(glm_model_no_genre)

Call:
glm(formula = revenue ~ budget_x + release_year + score, family = gaussian(), 
    data = train_data)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -4.975e+08  6.727e+08  -0.739    0.460    
budget_x      3.444e+00  9.315e-02  36.969  < 2e-16 ***
release_year  1.101e+05  3.340e+05   0.330    0.742    
score         4.525e+06  6.059e+05   7.467 1.48e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 3.36963e+16)

    Null deviance: 9.4348e+19  on 1335  degrees of freedom
Residual deviance: 4.4883e+19  on 1332  degrees of freedom
AIC: 54640

Number of Fisher Scoring iterations: 2
Code
predictions <- predict(glm_model_no_genre, newdata = test_data, type = "response")

# Evaluate the model performance
test_data$predicted_revenue <- predictions
rmse <- sqrt(mean((test_data$revenue - test_data$predicted_revenue)^2))
cat("RMSE:", rmse, "\n")
RMSE: 173741295 
Code
postResample(pred = predictions, obs = test_data$revenue)
        RMSE     Rsquared          MAE 
1.737413e+08 5.655436e-01 1.208809e+08 
Code
glm_metrics <- postResample(pred = predictions, obs = test_data$revenue)
rmse_glm <- glm_metrics[1]
rsq_glm <- glm_metrics[2]

cat("GLM Accuracy (R-Squared):", round(glm_metrics['Rsquared'] * 100, 2), "%\n")
GLM Accuracy (R-Squared): 56.55 %
Code
cat("GLM RMSE:", round(glm_metrics['RMSE'], 2), "\n")
GLM RMSE: 173741295 

The model’s RMSE (Root Mean Square Error) is quite high, which might indicate that the model’s predictions are not very close to the actual values on average, potentially due to the high variance of movie revenues. The R-squared value is approximately 0.565, which means that around 56.5% of the variability in revenue is explained by the model. McCullagh and Nelder (1989)

GAM Regression model

While GAMs offer flexibility, they also provide interpretability. The summary output of the GAM model will show the estimated smooth functions for each predictor variable, allowing to understand how each variable influences the response (revenue) nonlinearly. GAM allows for the modeling of nonlinear relationships through smooth functions, which can capture more complex relationships between variables. Hastie and Tibshirani (1990)

Code
set.seed(123) # for reproducibility
index <- createDataPartition(final_dataset$revenue, p = 0.7, list = FALSE)
train_data <- final_dataset[index, ]
test_data <- final_dataset[-index, ]

# Fit the GAM model on the training data
gam_model <- gam(revenue ~ s(budget_x) + s(release_year) + s(score), 
                 data = train_data, 
                 family = gaussian())

# Summary of the GAM model
summary(gam_model)

Family: gaussian 
Link function: identity 

Formula:
revenue ~ s(budget_x) + s(release_year) + s(score)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 230352588    4891994   47.09   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                  edf Ref.df       F p-value    
s(budget_x)     7.095  8.144 176.332  <2e-16 ***
s(release_year) 2.670  3.304   3.184  0.0186 *  
s(score)        5.252  6.249  11.474  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.548   Deviance explained = 55.3%
GCV = 3.2361e+16  Scale est. = 3.1973e+16  n = 1336
Code
# Predict on the test data
predictions <- predict(gam_model, newdata = test_data)

# Evaluate the model performance
test_data$predicted_revenue <- predictions
rmse <- sqrt(mean((test_data$revenue - test_data$predicted_revenue)^2))
cat("RMSE:", rmse, "\n")
RMSE: 168717953 
Code
gam_metrics <- postResample(pred = predictions, obs = test_data$revenue)
rmse_gam <- gam_metrics[1]
rsq_gam <- gam_metrics[2]


cat("GAM Accuracy (R-Squared):", round(gam_metrics['Rsquared'] * 100, 2), "%\n")
GAM Accuracy (R-Squared): 59.07 %
Code
cat("GAM RMSE:", round(gam_metrics['RMSE'], 2), "\n")
GAM RMSE: 168717953 

The significance of the smooth terms for budget_x and score indicates that these variables have important nonlinear effects on revenue, which the GAM is capturing. The model’s predictive performance is reasonable, but as with many real-world scenarios, there is room for improvement, likely due to the complex factors driving movie revenues beyond what’s captured in the model.

2nd Iteration

By enhancing the model with additional terms, such as interaction terms (ti()) between variables like log(budget_x) and score, it is aimed to refine the model’s ability to explain the variation in revenue. Introducing a logarithmic transformation (log(budget_x)) might address potential nonlinear relationships between budget and revenue, especially if there’s a diminishing return effect where larger budgets don’t necessarily lead to proportionally higher revenues. Similarly, including an interaction term between log(budget_x) and score (ti(log(budget_x), score)) allows the model to capture how the effect of budget on revenue might vary based on the score.

Code
gam_model_enhanced <- gam(revenue ~ s(log(budget_x)) + s(release_year) + s(score) + ti(log(budget_x), score),
                          data = train_data, family = gaussian(),
                          method = "REML")

# Summary of the enhanced GAM model
summary(gam_model_enhanced)

Family: gaussian 
Link function: identity 

Formula:
revenue ~ s(log(budget_x)) + s(release_year) + s(score) + ti(log(budget_x), 
    score)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 230686841    4918138   46.91   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                          edf Ref.df       F p-value    
s(log(budget_x))        6.922  7.958 163.554  <2e-16 ***
s(release_year)         2.603  3.220   2.561  0.0494 *  
s(score)                4.114  5.020  13.898  <2e-16 ***
ti(log(budget_x),score) 8.474  9.380   7.628  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.557   Deviance explained = 56.5%
-REML =  27206  Scale est. = 3.128e+16  n = 1336
Code
# Predict on the test data
predictions_enhanced <- predict(gam_model_enhanced, newdata = test_data)

# Evaluate the model performance on the test data
test_data$predicted_revenue_enhanced <- predictions_enhanced
rmse_enhanced <- sqrt(mean((test_data$revenue - test_data$predicted_revenue_enhanced)^2))
cat("Enhanced Model RMSE:", rmse_enhanced, "\n")
Enhanced Model RMSE: 164575876 
Code
gam_enhanced_metrics <- postResample(pred = predictions_enhanced, obs = test_data$revenue)
rmse_gam_enhanced <- gam_enhanced_metrics[1]
rsq_gam_enhanced <- gam_enhanced_metrics[2]

cat("Enhanced Model RMSE:", rmse_enhanced, "\n")
Enhanced Model RMSE: 164575876 
Code
cat("Enhanced Model R-squared:", rsq_gam_enhanced, "\n")
Enhanced Model R-squared: 0.6110777 

The second iteration of the GAM has improved the model’s explanatory power and predictive accuracy. The inclusion of logged budget values and an interaction term appears to provide a better understanding of how budget and score together influence movie revenue.

Random forest

A Random Forest model is built to predict revenue. Random Forest is a robust machine learning method that can capture complex interactions between variables.

Code
set.seed(123) # for reproducibility
index <- createDataPartition(final_dataset$revenue, p = 0.7, list = FALSE)
train_data <- final_dataset[index, ]
test_data <- final_dataset[-index, ]

# Fit the Random Forest model on the training data
rf_model <- randomForest(revenue ~ budget_x + release_year + score, 
                         data = train_data,
                         ntree = 500,
                         importance = TRUE)

# Summarize the model
print(rf_model)

Call:
 randomForest(formula = revenue ~ budget_x + release_year + score,      data = train_data, ntree = 500, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 2.99077e+16
                    % Var explained: 57.65
Code
# Variable importance
var_imp <- importance(rf_model)
print(var_imp)
               %IncMSE IncNodePurity
budget_x     116.62451  5.508127e+19
release_year  20.06603  1.202675e+19
score         19.55355  1.327588e+19
Code
# Predict on the test data
predictions <- predict(rf_model, newdata = test_data)

# Evaluate the model performance on the test data
test_data$predicted_revenue <- predictions
rmse_rf <- sqrt(mean((test_data$revenue - test_data$predicted_revenue)^2))
cat("Random Forest Model RMSE:", rmse_rf, "\n")
Random Forest Model RMSE: 165871030 
Code
# You may need to calculate the residuals for each movie
test_data$residuals <- test_data$revenue - test_data$predicted_revenue

# Aggregate residuals by release year
aggregate_resid <- aggregate(residuals ~ release_year, data = test_data, FUN = mean)

# Create a data frame for aggregate residuals
aggregate_resid_df <- as.data.frame(aggregate_resid)


# Creating an interactive plot using plotly
interactive_plot <- plot_ly(data = aggregate_resid_df, x = ~release_year, y = ~residuals, type = 'scatter', mode = 'lines+markers') %>%
  layout(title = "Average Residuals by Release Year",
         xaxis = list(title = "Release Year"),
         yaxis = list(title = "Average Residuals"))

# Display the plot
interactive_plot
Code
rf_metrics <- postResample(pred = predictions, obs = test_data$revenue)
rmse_rf <- rf_metrics[1]
rsq_rf <- rf_metrics[2]


cat("Random Forest Accuracy (R-Squared):", round(rf_metrics['Rsquared'] * 100, 2), "%\n")
Random Forest Accuracy (R-Squared): 60.68 %
Code
cat("Random Forest RMSE:", round(rf_metrics['RMSE'], 2), "\n")
Random Forest RMSE: 165871030 

From this we see that budget is the strongest predictor among the variables used. The variance explained by the model is quite high for such a complex target variable as movie revenue, but the large RMSE indicates that there is still considerable error in the predictions, which is not uncommon in predicting financial figures due to their high variability and influence by many factors not captured in the model.

Summarizing Regression Analysis

Code
# summary table
accuracy_summary <- data.frame(
  Model = c("GLM", "GAM", "GAM 2nd Iteration", "Random Forest"),
  RMSE = c(rmse_glm, rmse_gam, rmse_gam_enhanced, rmse_rf),
  R_Squared = c(rsq_glm, rsq_gam, rsq_gam_enhanced, rsq_rf)
)

print(accuracy_summary)
              Model      RMSE R_Squared
1               GLM 173741295 0.5655436
2               GAM 168717953 0.5907282
3 GAM 2nd Iteration 164575876 0.6110777
4     Random Forest 165871030 0.6067559

Classification and Confusion Matrix

This section demonstrates how to convert a continuous variable into a binary classification and then build a Random Forest classification model to predict these classes.

Code
# Convert revenue to a binary variable
median_revenue <- median(final_dataset$revenue, na.rm = TRUE)
final_dataset$revenue_class <- ifelse(final_dataset$revenue > median_revenue, "High", "Low")

set.seed(123)

# Splitting the dataset into training and testing sets
splitIndex <- createDataPartition(final_dataset$revenue_class, p = .8, list = FALSE, times = 1)
train_data <- final_dataset[splitIndex, ]
test_data <- final_dataset[-splitIndex, ]

# Prepare training control
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)

# Train the model using only the training set
rf_model <- train(revenue_class ~ budget_x + release_year + score + orig_lang,
                  data = train_data,
                  method = "rf",
                  trControl = fitControl,
                  metric = "ROC")

# Model Summary
print(rf_model)
Random Forest 

1526 samples
   4 predictor
   2 classes: 'High', 'Low' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 1373, 1374, 1374, 1374, 1374, 1373, ... 
Resampling results across tuning parameters:

  mtry  ROC        Sens       Spec     
   2    0.8716120  0.7487867  0.9117908
  13    0.9063002  0.8107712  0.8742082
  24    0.9031652  0.8077466  0.8571315

ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 13.
Code
predictions <- predict(rf_model, newdata = test_data)
predictions_factor <- factor(predictions, levels = c("Low", "High"))
actual_factor <- factor(test_data$revenue_class, levels = c("Low", "High"))

confMatrix <- confusionMatrix(predictions_factor, actual_factor)
print(confMatrix)
Confusion Matrix and Statistics

          Reference
Prediction Low High
      Low  170   40
      High  20  150
                                          
               Accuracy : 0.8421          
                 95% CI : (0.8015, 0.8773)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.6842          
                                          
 Mcnemar's Test P-Value : 0.01417         
                                          
            Sensitivity : 0.8947          
            Specificity : 0.7895          
         Pos Pred Value : 0.8095          
         Neg Pred Value : 0.8824          
             Prevalence : 0.5000          
         Detection Rate : 0.4474          
   Detection Prevalence : 0.5526          
      Balanced Accuracy : 0.8421          
                                          
       'Positive' Class : Low             
                                          

Random Forest classification model demonstrates good performance in distinguishing between the two revenue classes. The ROC value is quite high, suggesting that the model has a strong ability to discriminate between ‘High’ and ‘Low’ revenue movies. However, there is a trade-off between sensitivity and specificity as mtry changes, which is common in classification tasks. The model performs better at identifying ‘High’ revenue movies (as indicated by higher sensitivity) when mtry is set to 13.

Qualitative analysis

Wordcloud

A word cloud is generated to visualize the most common words in the overview column of the dataset. This qualitative analysis can give insights into prevalent themes or topics. Silge and Robinson (n.d.)

Code
corpus <- Corpus(VectorSource(final_dataset$overview))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)

# Build a term-document matrix
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
word_freqs <- sort(rowSums(m), decreasing = TRUE)
word_freqs_df <- data.frame(word = names(word_freqs), freq = word_freqs)

set.seed(1234)
wordcloud(words = word_freqs_df$word, freq = word_freqs_df$freq, min.freq = 1,
          max.words = 100, random.order = FALSE, rot.per = 0.35, 
          colors = brewer.pal(8, "Dark2"), main = "Most Common Words in Movie Overviews")

Impact of budget, critical scores, and use of specific words in the overview on a film’s revenue

Code
analysis_dataset <- final_dataset[, c("revenue", "budget_x", "release_year", "score", "overview")]

analysis_dataset$overview <- as.character(analysis_dataset$overview)

# Create indicator variables for the presence of keywords "young", "life", "new" in the new dataset
keywords <- c("young", "life", "new")
for(keyword in keywords) {
  analysis_dataset[[keyword]] <- as.numeric(grepl(keyword, analysis_dataset$overview, ignore.case = TRUE))
}

# Remove rows with NA 
analysis_dataset <- na.omit(analysis_dataset)


full_model_analysis <- lm(revenue ~ budget_x + release_year + score + young + life + new, data = analysis_dataset)
summary(full_model_analysis)

Call:
lm(formula = revenue ~ budget_x + release_year + score + young + 
    life + new, data = analysis_dataset)

Residuals:
       Min         1Q     Median         3Q        Max 
-724192214  -90006204  -24664889   63577766 2037166907 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -6.351e+08  5.637e+08  -1.127    0.260    
budget_x      3.471e+00  7.677e-02  45.216   <2e-16 ***
release_year  1.746e+05  2.798e+05   0.624    0.533    
score         4.581e+06  4.917e+05   9.317   <2e-16 ***
young         1.496e+06  1.203e+07   0.124    0.901    
life          6.988e+06  1.161e+07   0.602    0.547    
new           1.605e+07  1.187e+07   1.353    0.176    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 180700000 on 1899 degrees of freedom
Multiple R-squared:  0.5369,    Adjusted R-squared:  0.5355 
F-statistic:   367 on 6 and 1899 DF,  p-value: < 2.2e-16

The analysis underscores the critical roles that a movie’s budget and scores play in determining its revenue, confirming the intuitive expectation that films with higher budgets and superior scores likely indicative of better quality or more favorable reviews tend to achieve greater financial success. However, the investigation into the thematic elements, as denoted by the keywords “young,” “life,” and “new,” reveals a lack of statistical significance, suggesting these specific terms do not directly impact movie revenue when budget, release year, and score are accounted for. This finding posits that the themes suggested by these words may not be potent determinants of a movie’s financial outcome, or their effects might be moderated by other unexamined variables.

Word Association

Associations between the term “life” and other words in the overview are examined. A bar chart is then created to visualize the strength of these associations.

Code
text_corpus <- Corpus(VectorSource(final_dataset$overview))


text_corpus <- tm_map(text_corpus, content_transformer(tolower)) # Convert to lowercase
text_corpus <- tm_map(text_corpus, removePunctuation)            # Remove punctuation
text_corpus <- tm_map(text_corpus, removeNumbers)                # Remove numbers
text_corpus <- tm_map(text_corpus, removeWords, stopwords("english")) # Remove stopwords
text_corpus <- tm_map(text_corpus, stripWhitespace)              # Remove extra white spaces


tdm <- TermDocumentMatrix(text_corpus)

associations <- findAssocs(tdm, "life", 0.25) 

print(associations)
$life
        laura     acclaimed thirtyyearold  unapologetic          void 
         0.28          0.27          0.27          0.27          0.27 
 “iterations”       laura’s        andrea 
         0.27          0.26          0.25 
Code
#-----------------Visualisation-------------------------------------------

# Data
terms <- c("laura", "acclaimed", "thirtyyearold", "unapologetic", "void",
           "iterations", "laura’s", "andrea", "exploring", "desires",
           "fredricksen", "persistent", "deepest", "horribly", "realities",
           "everyday", "womanizer", "lease", "fullest", "longs", "many")

values <- c(0.28, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.25, 0.23, 0.22,
            0.21, 0.21, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.16)

# bar chart
fig <- plot_ly(x = ~terms, y = ~values, type = 'bar', marker = list(color = 'indianred'))
fig <- fig %>% layout(title = 'Term Frequencies in $life Theme',
                      xaxis = list(title = 'Terms', tickangle = -45),
                      yaxis = list(title = 'Frequency'),
                      template = 'plotly_white')

fig

The results from examining how “life” is discussed in your collection of texts show fascinating links between “life” and a variety of terms such as “laura,” “acclaimed,” “thirtyyearold,” “unapologetic,” “void,” “iterations,” “laura’s,” and “andrea,” with connection strengths noted between 0.25 to 0.28. This pattern hints that conversations or stories about “life” in your texts often include these specific words, potentially signaling a focus on personal tales, self-reflection, and possibly the experiences of distinct individuals like Laura. The occurrence of descriptors like “acclaimed” and “unapologetic” in proximity to “life” may suggest that these stories about life are widely recognized or delve into themes of genuine self-expression and discovery. This examination offers a deeper insight into the way “life” is woven into your texts, highlighting the depth of stories that delve into the intricacies of life and the characters embarking on these journeys.

Geographical data analysis using Leaflet:

Geospatial analysis is performed to visualize the data on a world map. Two types of visualizations are created: a heatmap and cluster markers, both providing geographical insights into the dataset. “Leaflet: An Open-Source JavaScript Library for Mobile-Friendly Interactive Maps” (n.d.)

Code
# Convert ISO country codes to country names
final_dataset$country_name <- countrycode(final_dataset$country.y, "iso2c", "country.name")

# world map data, which includes lat/lon for country centers
world_map <- map_data("world")

# unique list of countries with their mean latitude and longitude
country_coords <- world_map %>%
  group_by(region) %>%
  summarize(lat = mean(lat), lon = mean(long), .groups = 'drop')

# Merge the coordinates
final_dataset <- merge(final_dataset, country_coords, by.x = "country_name", by.y = "region", all.x = TRUE)

na_count_country_name <- sum(is.na(final_dataset$country_name))
cat("Number of NA values in country_name column:", na_count_country_name, "\n")
Number of NA values in country_name column: 0 
Code
na_count_country_code <- sum(is.na(final_dataset$country.y))
cat("Number of NA values in country.y column:", na_count_country_code, "\n")
Number of NA values in country.y column: 0 
Code
invalid_rows <- which(is.na(final_dataset$lat) | is.na(final_dataset$lon) |
                        !is.numeric(final_dataset$lat) | !is.numeric(final_dataset$lon))

final_dataset <- final_dataset %>%
  mutate(
    lat = ifelse(country.y == "US", 37.0902, lat),
    lon = ifelse(country.y == "US", -95.7129, lon),
    lat = ifelse(country.y == "GB", 55.3781, lat),
    lon = ifelse(country.y == "GB", -3.4360, lon),
    lat = ifelse(country.y == "HK", 22.3193, lat),
    lon = ifelse(country.y == "HK", 114.1694, lon)
  )

#Heatmap
install.packages("leaflet.extras")
Installing package into 'C:/Users/PC/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
package 'leaflet.extras' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\PC\AppData\Local\Temp\RtmpumJFl3\downloaded_packages
Code
library("leaflet.extras")
leaflet(data = final_dataset) %>%
  addTiles() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addHeatmap(lng = ~lon, lat = ~lat, intensity = ~1, radius = 20, blur = 15) %>%
  setView(lng = mean(final_dataset$lon, na.rm = TRUE), 
          lat = mean(final_dataset$lat, na.rm = TRUE), 
          zoom = 2)
Code
#Cluster Markers
leaflet(data = final_dataset) %>%
  addTiles() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addCircleMarkers(lng = ~lon, lat = ~lat, popup = ~country_name, 
                   clusterOptions = markerClusterOptions()) %>%
  setView(lng = mean(final_dataset$lon, na.rm = TRUE), 
          lat = mean(final_dataset$lat, na.rm = TRUE), 
          zoom = 2)

Insights

Visualizations are created to glean insights from the final_dataset. These include interactive plots for revenue by country, distribution of shows by genre, average scores by original language, and a scatter plot of budget versus revenue.

Total revenue by Country

The bar chart displays total revenue by country, with Australia (AU) showing the highest revenue, represented by a prominent bar on the far left. The revenue figures are presented on a logarithmic scale, which means each increment on the y-axis represents a tenfold increase in revenue. This scale is useful for displaying data that varies exponentially, as it allows for easier visual comparisons between values that differ by orders of magnitude.

Code
# Sum revenue by country
revenue_by_country <- final_dataset %>%
  group_by(country_name) %>%
  summarise(total_revenue = sum(revenue, na.rm = TRUE))

# Using a logarithmic scale on the y-axis 
interactive_plot <- plot_ly(data = revenue_by_country, x = ~country_name, y = ~total_revenue, type = 'bar',
                            marker = list(color = ~total_revenue, colorscale = 'Jet')) %>%
  layout(title = 'Total Revenue by Country (Log Scale)',
         xaxis = list(title = 'Country'),
         yaxis = list(title = 'Total Revenue (Log Scale)', type = 'log'))
interactive_plot

Distribution of shows by Genre

This visual presents the number of shows across various genres. The bars represent the count of shows in each genre, offering a clear comparison of their popularity or prevalence within the dataset.This visual representation helps stakeholders quickly gauge which genres are most common and might be driving viewership or production focus. It could inform decisions in content creation, marketing strategies, or market analysis.

Code
#Shows by Genre
genre_counts <- final_dataset %>%
  separate_rows(genre, sep = ",") %>%
  count(genre, sort = TRUE)
  
interactive_bar_chart <- plot_ly(data = genre_counts, x = ~genre, y = ~n, type = 'bar', marker = list(color = 'rgba(55, 128, 191, 0.7)')) %>%
  layout(title = 'Distribution of Shows by Genre',
         xaxis = list(title = 'Genre'),
         yaxis = list(title = 'Count'))

interactive_bar_chart
Code
#  interactive pie chart of shows by Genre
plot_ly(data = genre_counts, labels = ~genre, values = ~n, type = 'pie') %>%
  layout(title = 'Distribution of Shows by Genre')

Line chart showing Average score of each movie language

This line chart displays the average scores of shows categorized by their original language. The vertical axis represents the average score, while the horizontal axis lists various languages. The average score can be influenced by the number of shows in each language category and the diversity of genres within each language. This data could be useful for content providers looking to understand audience preferences related to language or for market analysis focused on content performance across different linguistic markets.

Code
# avg score by language
avg_score_by_lang <- final_dataset %>%
  group_by(orig_lang) %>%
  summarise(average_score = mean(score, na.rm = TRUE))

# plot
plot_ly(data = avg_score_by_lang, x = ~orig_lang, y = ~average_score, type = 'scatter', mode = 'markers+lines') %>%
  layout(title = 'Average Score by Original Language',
         xaxis = list(title = 'Original Language'),
         yaxis = list(title = 'Average Score'))

Scatter plot between Budget and Revenue

This scatter plot illustrates the relationship between the budget and revenue of various films. The horizontal axis represents the budget, while the vertical axis shows the revenue. There seems to be a general trend where films with higher budgets also tend to have higher revenues. This is evident from the spread of points trending upwards as we move right along the budget axis. This kind of analysis can be particularly valuable for film producers and investors looking to understand the potential return on investment and to make informed decisions about budget allocations for future projects. It can also prompt further investigation into what other factors contribute to a film’s commercial success.

Code
#scatter plot Budget vs Revenue
plot_ly(data = final_dataset, x = ~budget_x, y = ~revenue, type = 'scatter', mode = 'markers') %>%
  layout(title = 'Budget vs. Revenue',
         xaxis = list(title = 'Budget'),
         yaxis = list(title = 'Revenue'))

Conclusion

The production budgets and scores are significant predictors of a movie’s revenue potential, which can guide financial investment strategies in the film industry. Despite their cultural value, thematic elements within movie descriptions do not show a direct correlation with financial success. Advanced statistical models such as GLM, GAM, and Random Forest elucidate the intricate factors driving a film’s market performance, emphasizing the role of data-driven decision-making in shaping effective entertainment business strategies.

References

B., Shivam. n.d. “Disney Plus Movies Dataset.” https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows.
———. n.d. “Netflix Movies Dataset.” https://www.kaggle.com/datasets/shivamb/netflix-shows.
CRAN - Available Packages by Name.” n.d. https://cran.r-project.org/web/packages/available_packages_by_name.html.
Hastie, T. J., and R. J. Tibshirani. 1990. Generalized Additive Models. Chapman; Hall/CRC.
K., Nandhiraja. n.d. IMDB Movies Dataset.” https://www.kaggle.com/code/nandhirajak/movie-recommendation-system.
“Leaflet: An Open-Source JavaScript Library for Mobile-Friendly Interactive Maps.” n.d. https://leafletjs.com/.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. 2nd ed. Chapman; Hall/CRC.
Mermaid: Generation of Diagram and Flowchart from Text in a Similar Manner as Markdown.” n.d. https://mermaid-js.github.io/mermaid/.
“Quarto: An Open-Source Scientific and Technical Publishing System Built on Pandoc.” n.d. https://quarto.org/.
“R: The r Project for Statistical Computing.” n.d. https://www.r-project.org/.
Silge, J., and D. Robinson. n.d. “Text Mining with r: A Tidy Approach.” https://www.tidytextmining.com/.
Wickham, H. n.d. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org/.